This report explores a Prosper loan dataset that contains peer-to-peer loan listing details of consumers with Prosper. We start with exploratory analysis on variables of interest, and then raise some deeper questions on the underlying relationship between variables. Finally, we build a predictive linear model for borrower’s interest rate and justify the output.
The prosper data set consists of 113,937 observations and 81 variables. The data is on the level of ListingNumber of each loan. The first step of analysis is to import the data from csv file to R dataframe. Then we get a better understanding of the data by looking at the variable list, variable types, the print-out and the distribution of variables, and the unique count and missing count of selective variables such as borrower rate, term, prosper score, loan amount, credit history etc.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113048 levels "00003546482094282EF90E5",..: 7180 7192 6492 6514 6531 6534 6544 6551 6532 6532 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 34801 levels "00:00.0","00:00.1",..: 5431 16274 452 1489 22348 15402 30617 28572 25250 25250 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","02:17.6","02:25.0",..: 2384 1 802 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : int NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : logi TRUE FALSE FALSE TRUE TRUE TRUE ...
## $ CurrentlyInGroup : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ GroupKey : Factor w/ 707 levels "","00943382969547936B0C529",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 100278 levels "00:00.0","00:00.2",..: 63351 43901 2975 15398 94325 29949 58623 70468 31274 31274 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1/1/1951 0:00",..: 1072 5084 9271 4506 4275 4773 11336 4423 5797 5797 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : int 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : int 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : int 472 0 NA 10056 0 0 0 0 0 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : int 0 3989 NA 1444 6193 62999 5812 1260 9906 9906 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : int 1500 10266 NA 30754 695 86509 1929 2181 77696 77696 ...
## $ TotalTrades : int 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : int 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0 ","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113044 levels "00003683605746079487FF7",..: 100315 69815 46242 70754 71365 86483 91228 5294 881 881 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "1/10/2006 0:00",..: 1729 880 35 314 1783 541 974 1099 476 476 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90824 levels "00003397697413387CAF966",..: 10981 10212 33733 54932 19440 48049 60441 40985 26069 26069 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## ListingNumber Term BorrowerRate EstimatedReturn ProsperScore
## 1 193129 36 0.1580 NA NA
## 2 1209647 36 0.0920 0.05470 7
## 3 81716 36 0.2750 NA NA
## 4 658116 36 0.0974 0.06000 9
## 5 909464 36 0.2085 0.09066 4
## 6 1074836 60 0.1314 0.07077 10
## EmploymentStatus IsBorrowerHomeowner CurrentCreditLines TotalInquiries
## 1 Self-employed TRUE 5 3
## 2 Employed FALSE 14 5
## 3 Not available FALSE NA 1
## 4 Employed TRUE 5 1
## 5 Employed TRUE 19 9
## 6 Employed TRUE 21 2
## CurrentDelinquencies AmountDelinquent BankcardUtilization
## 1 2 472 0.00
## 2 0 0 0.21
## 3 1 NA NA
## 4 4 10056 0.04
## 5 0 0 0.81
## 6 0 0 0.39
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## 1 0.17 3083.333 9425
## 2 0.18 6125.000 10000
## 3 0.06 2083.333 3001
## 4 0.15 2875.000 10000
## 5 0.26 9583.333 15000
## 6 0.36 8333.333 15000
## LoanOriginationQuarter Investors CreditScoreRangeLower
## 1 Q3 2007 258 640
## 2 Q1 2014 1 680
## 3 Q1 2007 41 480
## 4 Q4 2012 158 800
## 5 Q3 2013 20 680
## 6 Q4 2013 1 740
## CreditScoreRangeUpper
## 1 659
## 2 699
## 3 499
## 4 819
## 5 699
## 6 759
## ListingNumber Term BorrowerRate EstimatedReturn
## Min. : 4 Min. :12.00 Min. :0.0000 Min. :-0.183
## 1st Qu.: 400919 1st Qu.:36.00 1st Qu.:0.1340 1st Qu.: 0.074
## Median : 600554 Median :36.00 Median :0.1840 Median : 0.092
## Mean : 627886 Mean :40.83 Mean :0.1928 Mean : 0.096
## 3rd Qu.: 892634 3rd Qu.:36.00 3rd Qu.:0.2500 3rd Qu.: 0.117
## Max. :1255725 Max. :60.00 Max. :0.4975 Max. : 0.284
## NA's :29084
## ProsperScore EmploymentStatus IsBorrowerHomeowner
## Min. : 1.00 Employed :67322 Mode :logical
## 1st Qu.: 4.00 Full-time :26355 FALSE:56459
## Median : 6.00 Self-employed: 6134 TRUE :57478
## Mean : 5.95 Not available: 5347 NA's :0
## 3rd Qu.: 8.00 Other : 3806
## Max. :11.00 : 2255
## NA's :29084 (Other) : 2718
## CurrentCreditLines TotalInquiries CurrentDelinquencies
## Min. : 0.00 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 7.00 1st Qu.: 2.000 1st Qu.: 0.0000
## Median :10.00 Median : 4.000 Median : 0.0000
## Mean :10.32 Mean : 5.584 Mean : 0.5921
## 3rd Qu.:13.00 3rd Qu.: 7.000 3rd Qu.: 0.0000
## Max. :59.00 Max. :379.000 Max. :83.0000
## NA's :7604 NA's :1159 NA's :697
## AmountDelinquent BankcardUtilization DebtToIncomeRatio
## Min. : 0.0 Min. :0.000 Min. : 0.000
## 1st Qu.: 0.0 1st Qu.:0.310 1st Qu.: 0.140
## Median : 0.0 Median :0.600 Median : 0.220
## Mean : 984.5 Mean :0.561 Mean : 0.276
## 3rd Qu.: 0.0 3rd Qu.:0.840 3rd Qu.: 0.320
## Max. :463881.0 Max. :5.950 Max. :10.010
## NA's :7622 NA's :7604 NA's :8554
## StatedMonthlyIncome LoanOriginalAmount LoanOriginationQuarter
## Min. : 0 Min. : 1000 Q4 2013:14450
## 1st Qu.: 3200 1st Qu.: 4000 Q1 2014:12172
## Median : 4667 Median : 6500 Q3 2013: 9180
## Mean : 5608 Mean : 8337 Q2 2013: 7099
## 3rd Qu.: 6825 3rd Qu.:12000 Q3 2012: 5632
## Max. :1750003 Max. :35000 Q2 2012: 5061
## (Other):60343
## Investors CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 1.00 Min. : 0.0 Min. : 19.0
## 1st Qu.: 2.00 1st Qu.:660.0 1st Qu.:679.0
## Median : 44.00 Median :680.0 Median :699.0
## Mean : 80.48 Mean :685.6 Mean :704.6
## 3rd Qu.: 115.00 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :1189.00 Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## ListingNumber Term BorrowerRate
## 113066 3 2294
## EstimatedReturn ProsperScore EmploymentStatus
## 1477 12 9
## IsBorrowerHomeowner CurrentCreditLines TotalInquiries
## 2 55 103
## CurrentDelinquencies AmountDelinquent BankcardUtilization
## 48 7060 202
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## 1208 13502 2468
## LoanOriginationQuarter Investors CreditScoreRangeLower
## 33 751 27
## CreditScoreRangeUpper
## 27
## ListingNumber Term BorrowerRate
## 0 0 0
## EstimatedReturn ProsperScore EmploymentStatus
## 29084 29084 0
## IsBorrowerHomeowner CurrentCreditLines TotalInquiries
## 0 7604 1159
## CurrentDelinquencies AmountDelinquent BankcardUtilization
## 697 7622 7604
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## 8554 0 0
## LoanOriginationQuarter Investors CreditScoreRangeLower
## 0 0 591
## CreditScoreRangeUpper
## 591
##
## 12 36 60
## 1614 87778 24545
Most Prosper loans have a term of 36 months, followed by 60 months. Very few loans have a 12-month term.
##
## Cancelled Chargedoff Completed Current Past Due
## 5 11992 38074 56576 7290
The first graph shows the number of Prosper loans by category of loan status. As there are multiple ‘Past Due’ columns that have low counts, we group them into one field as shown in the second graph. Naturally, most cases are current or completed. There are about 7,290 past-due loan cases, which constitutes about 6.4% of all loans with Prosper.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 984.5 0.0 463900.0 7622
##
## 0
## 89818
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 246 1067 6345 5061 463900
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.391 3.028 3.040 3.704 5.666
Almost 80% of borrowers, also 89,818 out of 113,937 don’t have delinquency. 7,622 of the rest are unavailable. Taking the log of AmountDelinquent, the plot shows a normal distribution with log10 value of 3 being the peak, which is a few thousand dollars on average.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.32 13.00 59.00 7604
The distribution of current credit lines skewed right, peaking around 7 to 10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The loan original amount is right skewed, having multiple peaks of $3,000, $10,000, and $15,000. Most loans are below the amount of $12,000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The borrower rate, also the interest rate of consumers peaks at about 3.2% with more than 5,500 cases. The range goes from 5% to 36% for most consumers. And the average value of borrower rate is 19.3%.
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
Prosper rate is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score’. There are some score outliers that are larger than the maximum score 10. Other than that, the Prosper score ‘prefers’ to take even values, 4, 6 and 8.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 16965 58308 7433 7189 2395 756 2572 10494 199 85 91 217
## 12 13 14 15 16 17 18 19 20
## 59 1996 876 1522 304 52 885 768 771
This is an interesting histogram that shows the purpose of a personal loan. Code number 1 (Debt Consolidation) has almost 60K records surpassing all the other listing catogories.
The income range follows a normal distribution, with most borrowers having $25000~$49000 and $50000~$74999. Interestingly some borrowers have $0 income. Also there doesn’t exist a ‘higher than $100000’ bucket, which is not uncommon in high-tech regions such as California.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
As a requirement of Prosper loan application, the debt to income ratio should be less than 0.5 to qualify. Most of the loan cases in the dataset satisfies this rule with some outliers. In most cases, the debt to income ratio for consumers ranges from 0.1 to 0.3. This is a useful indicator for people who are interested in applying for a Prosper loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
The plot dipicts the lower range and upper range of credit scores of borrowers. Lower is shown is red while upper in blue.
The distribution of credit scores of borrowers is close to normal distribution, peaking around 680 to 700.
##
## 2005 Q4 2006 Q1 2006 Q2 2006 Q3 2006 Q4 2007 Q1 2007 Q2 2007 Q3 2007 Q4
## 22 315 1254 1934 2403 3079 3118 2671 2592
## 2008 Q1 2008 Q2 2008 Q3 2008 Q4 2009 Q2 2009 Q3 2009 Q4 2010 Q1 2010 Q2
## 3074 4344 3602 532 13 585 1449 1243 1539
## 2010 Q3 2010 Q4 2011 Q1 2011 Q2 2011 Q3 2011 Q4 2012 Q1 2012 Q2 2012 Q3
## 1270 1600 1744 2478 3093 3913 4435 5061 5632
## 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1
## 4425 3616 7099 9180 14450 12172
The first plot is not in chronological order. After a rearrange of year and quarter, we have the second time series graph that depicts ups and downs in number of loan listings the company has. Prosper was founded in year 2005, which explains the low count in the fourth quarter of 2005. The business gradually grew in the next few years, then had a very sharp decline in 2008 because of the global financial crisis. The recovery took some time when the number of loans climbed up in 2009 and 2010. By the end of 2011, the company regained the history-high record. Overall, the growth was most impressive in year 2013 when the loan amount rocketed to almost 15,000 in the fourth quarter.
There are 113,937 observations of 81 variables in the Prospect Loan dataset. In this analysis we are going to focus on fields - Term, BorrowerRate, ProsperScore, EmploymentStatus, IsBorrowerHomeowner, CurrentCreditLines, CurrentDelinquencies, AmountDelinquent, DebtToIncomeRatio,StatedMonthlyIncome,LoanOriginalAmount, LoanOriginationQuarter, CreditScoreRangeLower and CreditScoreRangeUpper etc.
Term The length of the loan expressed in months.
BorrowerRate The Borrower’s interest rate for this loan.
ProsperScore A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.
Applicable for loans originated after July 2009. (Note that there are 29084 missing values.)
CreditScoreRangeLower/Upper The lower/upper value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
AmountDelinquent Dollars delinquent at the time the credit profile was pulled.
DebtToIncomeRatio The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
CurrentCreditLines Number of current credit lines at the time the credit profile was pulled.
ListingCategory The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
LoanOriginalAmount The origination amount of the loan.
The term of loan are mostly 36 and 60 months, in other words 3 or 5 years.
Almost half of the records are current loans. 6.4% are past due.
The IQR of the borrower rate is 14% ~ 32%, with 32% being the most popular rate of all loans.
Prosper score follows a pseudo normal distribution in even and odd sets of scores respectively. It’s interesting that there are overall more even scores than odd scores in the middle range 3~8.
The credit score median is normal distributed, peaking around 680 - 700.
Almost 80% of borrowers don’t have any delinquency.
The borrowers’ credit Lines is right skewed, peaking around 7 to 10.
75 Quantile of loan amount is $12,000. There are several peaks at $3,000, $10,000, and $15,000.
More than half of the listings are created for the purpuse of Debt Consolidation, followed by Home Improvement, Business and Auto.
Most borrowers have an income of $25000 to 75000 annually.
The number of loans at Prosper follows a chronological pattern.
The most important features in the dataset are BorrowerRate and ProsperScore. For a consumer seeking for a peer-to-peer personal loan, the interest rate will be the key factor that one considers when shopping around. I am interested in finding the relationship between BorrowerRate and ProsperScore, which is solely based on Prosper history of the consumer who is applying. For that reason, there are other factors from the credit history perspective that also contributes to the BorrowerRate, such as CreditScoreRangeLower/Upper.
The term, consumer’s credit history, time of application and monthly income will help support my investigation.
I regrouped LoanStatus ‘Past Due’ section into one, because each sub-category of ‘Past Due’ has too few records compared to other categories. The final histogram of 5 main categories is more meaningful to the audience.
Moreover, I rearranged the LoanOriginationQuarter variable so that the value of the new catogories are chronological, which provides convenience and clarity in the time series histogram. We could see how Prosper’s loan business fluctuate over time.
The ProsperScore is supposed to range from 0 to 10, but I found some outliers valued at 11. These could be bador old data(earlier than July 2009), so I excluded them from the histogram by setting the limits on x axis. In this way, we have a clean vision of the distribution of ProsperScore.
At the beginning of this section, I create a right subset function that scans the string from the right side. Also I create two new variables. The first is the CreditScoreMedian calculated from CreditScoreRangeLower and Upper. The second is CreditHistory calculated from FirstRecordedCreditLine. CreditHistory is on the level of years.
The above ggpairs graph only reflect the 56,576 current loans with Prosper. It still provides a full insight on the most up-to-date loan information that assists future analysis between two and multiple variables. As we can see, the ggcorr graph provides a clear illustration on variable correlation. Pairs in dark red and blue will be focused on in the follow-up analysis.
Prosper interest rate is lower when a borrower has higher ProsperScore, proving the strong linear correlation of -0.7.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3667 5167 6153 7447 1750000
The stated monthly income axis is tailored as the majority of monthly income are well below $20,000. From the second scatterplot, there doesn’t exist a strong correlation between income and borrower rate.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0577 0.1314 0.1760 0.1838 0.2310 0.3304
The credit score axis is adjusted to normal credit range by common sense. It looks from the graph that CreditScore and BorrowerRate are negatively correlated at level of -0.5 by the Pearson correlation test.
The boxplot shows the relationship between prosper score and employment status. Full-time consumers have the highest IQR and mean of prosper score, while Self-employed category have the lowest. As we can observe, the distribution of full-time category is skewed, since people may have all kinds of occupations whose income and credit habits vary a lot. It is surprising that people who are ‘Retired’ have higher prosper score than Employed and Self-employed categories. I wonder what the differenc is between category Employed and Full-time.
How active is each type of employment in credit cards and loans activities? The first boxplot tells us that the part-time group is the least active in having loans and openning credit cards. The Employed group have the most outliers, followed by Full-time and Self-employed. It is understandable that financial institutions are more willing to open credit lines for small business or consumers who have a stable income.
Zooming in to credit lines of 0~25, it’s clear that Employed consumers have the highest credit lines open of about 11, while Not-employed and Part-time have the lowest average of about 7.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 984.5 0.0 463900.0 7622
This jitter plot graph shows that consumers who have are homeowners have a longer tailed delinquent distribution, thus slightly higher delinquency amount if exists than those who are mortgage care-free.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Prosper requires a consumer to have debt to income ratio of less than 50% to qualify for the loan application. From the first graph there are lots of outliers above 50%, even up to 1001% where the value is capped. Zooming in, we can see there is a full range of debt to income ratio from 0 to 60% and 70% for consumers with a credit history of 10 to 40 years. The peak is at 20~25 years of credit history. Let’s say a consumer has the first credit card at age 20. Adding 20 years makes him 40 plus, which is about the age most people have a home and kids to supply. It makes sense these consumers have more debts at the peak of their financial burden and responsibilities.
Next we want to build a simple linear regression model to predict the borrower rate, using prosper score as the only predictor.
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ I(ProsperScore), data = subset(loan.lm1,
## !is.na(ProsperScore)))
##
## ===================================
## (Intercept) 0.317***
## (0.001)
## I(ProsperScore) -0.020***
## (0.000)
## -----------------------------------
## R-squared 0.422
## adj. R-squared 0.422
## sigma 0.057
## F 61989.938
## p 0.000
## Log-likelihood 123078.549
## Deviance 273.094
## AIC -246151.098
## BIC -246123.052
## N 84853
## ===================================
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and ProsperScore
## t = -248.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6536072 -0.6458311
## sample estimates:
## cor
## -0.6497361
We can observe a negative relationship between the borrower rate and prosper score. The Pearson correlation test also indicates a negative linear relationship between the two variables(correlation = -0.65).
As for the linear model, the explanatory variable ProsperScore has a p value much less thatn 0.05, and it accounts for 42.2% of the variance in BorrowerRate. The parameter of prosper score is -0.02, meaning that as prosper score increases by one unit, the borrower rate decreases by 0.02, which is 2%.
In the next section we are going to include more credit-related and loan-related variables that could explain BorrowerRate.
First, prosper score has a negative positive relationship with interest rate. The higher the prosper score, the lower(the better) the interest rate.
Second, there doesn’t exist a noticable linear correlation between stated monthly income and borrower rate. As we see the pearson correlation coefficient is as low as 0.1.
Additionally, full-time and employed consumers have the highest average prosper score and also the highest average number of open credit lines. Self-employed consumers have high number of credit lines, most likely for business reasons.
A borrower who already has mortgage tend to have more amount delinquency as the pressure of mortgage and household must be high.
Borrowers having 20~30 years of credit history have the highest debt to income ratio, with some outliers rocketing up to more than 1000%. The ratio becomes much lower for people who have 40 years of credit history, making them about 60 years old.
The debt to income ratio of consumers is very concentrated around the area of credit history ranging 10 to 35. A negative relationship can be observved from the plotting as well. As a requirement to apply for personal loans at Prosper, a borrower need to have minimum two years of credit history, and the reported average value is 11, which are both verified from the graph. As these two variables are vital criteria in loan application, we are going to investigate them more in the next modeling section.
## # A tibble: 10 x 4
## BorrowerState loan_count loan_mean BorrowerStateFull
## <fct> <int> <dbl> <chr>
## 1 CA 14717 8974. california
## 2 TX 6842 9088. texas
## 3 NY 6729 8833. new york
## 4 FL 6720 8207. florida
## 5 IL 5921 8396. illinois
## 6 "" 5515 5412. <NA>
## 7 GA 5008 8363. georgia
## 8 OH 4197 8078. ohio
## 9 MI 3593 7645. michigan
## 10 VA 3278 8971. virginia
Grouped the loan data from listing level to state level, calculating the number of loan for each state as loan_count and average loan amount as loan_mean from LoanOriginalAmount.
The printed data frame is ordered by decending loan_count. The top 5 states having the highest number of loan counts are California, Texas, New York, Florida and Illinois.
Below is a geographical heatmap of the number of Prosper loans in the United States in Prosper history, which presents the same information as the above data.
## # A tibble: 6 x 4
## BorrowerState loan_count loan_mean BorrowerStateFull
## <fct> <int> <dbl> <chr>
## 1 DC 382 10126. district of columbia
## 2 NJ 3097 9529. new jersey
## 3 MA 2242 9515. massachusetts
## 4 NH 551 9463. new hampshire
## 5 AK 200 9312. alaska
## 6 MD 2821 9250. maryland
Additionally, the second heat map dipicts the average loan_mean for each state. It’s shown that although the states on the east coast have low loan counts, their loan amount is very high. This interesting combination is probably due to the distance from the Prosper hub on the west coast but also the fast-growing economy on the east coast.
Next is BorrowerRate over loan origination year in an effort to check if there is an trend in the interest rate of borrowers over time.
We only observe the top 6 states with the highest loan_count here. Variable LoanOrigYear is created from LoanOriginationQuarter by only capturing the year the loan was created.
This faceted boxplot provides an amazing illustration on how the BorrowerRate fluctuates over time, in other words, in the general economy background.
The financial crisis really hit the economy and the personal loan market in 2008 and 2009, forcing vendors like Prosper lower their interest rate to attract borrowers. From the perspective of the states, Texas and New York didn’t do very well in particular during the crisis. In comparison, California, Gorgia and Illinois suffered less hit.
The loan business came back to live in 2011 and 2012 as the economy started to revitalize. We can observe a very good performance for New York and Texas.
Similarly, the jitter point graph also indicates much less business in 2008 and fast growth in 2011 and 2012 as the density of colors change. In particular, we see long term loans as shown in light blue starting to boom in 2011. This also relates back to the previous boxplot that in 2011 and 2012 the average of BorrowerRate was high.
Use Prosper history information as well as loan-related information as predictors for BorrowerRate.
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(CreditScoreMedian),
## data = loan)
## m2: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(CreditScoreMedian) +
## Term, data = loan)
## m3: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(CreditScoreMedian) +
## Term + LoanOrigYear, data = loan)
## m4: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(CreditScoreMedian) +
## Term + LoanOrigYear + LoanOriginalAmount, data = loan)
## m5: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(CreditScoreMedian) +
## Term + LoanOrigYear + LoanOriginalAmount + StatedMonthlyIncome,
## data = loan)
##
## ========================================================================================================
## m1 m2 m3 m4 m5
## --------------------------------------------------------------------------------------------------------
## (Intercept) 0.645*** 0.639*** 52.623*** 48.522*** 48.513***
## (0.003) (0.003) (0.259) (0.276) (0.276)
## I(ProsperScore) -0.017*** -0.017*** -0.019*** -0.018*** -0.018***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## I(CreditScoreMedian) -0.000*** -0.000*** -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## Term 0.000*** 0.001*** 0.001*** 0.001***
## (0.000) (0.000) (0.000) (0.000)
## LoanOrigYear -0.026*** -0.024*** -0.024***
## (0.000) (0.000) (0.000)
## LoanOriginalAmount -0.000*** -0.000***
## (0.000) (0.000)
## StatedMonthlyIncome 0.000***
## (0.000)
## --------------------------------------------------------------------------------------------------------
## R-squared 0.5057 0.5066 0.6653 0.6717 0.6718
## adj. R-squared 0.5057 0.5066 0.6653 0.6717 0.6717
## sigma 0.0525 0.0524 0.0432 0.0428 0.0428
## F 43399.7450 29044.6500 42164.2025 34720.3395 28939.1491
## p 0.0000 0.0000 0.0000 0.0000 0.0000
## Log-likelihood 129702.0207 129784.8619 146246.3116 147066.2547 147072.0398
## Deviance 233.6208 233.1651 158.1822 155.1545 155.1333
## AIC -259396.0414 -259559.7238 -292480.6232 -294118.5094 -294128.0796
## BIC -259358.6467 -259512.9804 -292424.5312 -294053.0687 -294053.2901
## N 84853 84853 84853 84853 84853
## ========================================================================================================
The six vairables accounts for 67.18% of the variation of the BorrowerRate. The p value is less than 0.05 for each predictor, so we are confident to reject the null hypothesis that coefficients are 0. Therefore, the coefficients are valid in the above chart. In other words, the 6 variables are all meaningful in prediction of the borrower interest rate at Prosper.
A few adjustments are made to the second iteration of the model.
Use log transformation on large values CreditScoreMedian, LoanOriginalAmount, StatedMonthlyIncome. Adjust if the value before log is less than 1.
Use year as unit for Term instead of month.
Change LoanOrigYear from numeric to character because the quantity doesn’t contribute to the ProsperScore by common sense.
Add three credit-related predictors of interest - CreditHistory, DebtToIncomeRatio, IsBorrowerHomeowner.
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)),
## data = loan)
## m2: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)) +
## I(Term/12), data = loan)
## m3: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)) +
## I(Term/12) + as.factor(LoanOrigYear), data = loan)
## m4: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)) +
## I(Term/12) + as.factor(LoanOrigYear) + log(LoanOriginalAmount),
## data = loan)
## m5: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)) +
## I(Term/12) + as.factor(LoanOrigYear) + log(LoanOriginalAmount) +
## log(StatedMonthlyIncome + 1), data = loan)
## m6: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)) +
## I(Term/12) + as.factor(LoanOrigYear) + log(LoanOriginalAmount) +
## log(StatedMonthlyIncome + 1) + DebtToIncomeRatio, data = loan)
## m7: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)) +
## I(Term/12) + as.factor(LoanOrigYear) + log(LoanOriginalAmount) +
## log(StatedMonthlyIncome + 1) + DebtToIncomeRatio + CreditHistory,
## data = loan)
## m8: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + I(log(CreditScoreMedian)) +
## I(Term/12) + as.factor(LoanOrigYear) + log(LoanOriginalAmount) +
## log(StatedMonthlyIncome + 1) + DebtToIncomeRatio + CreditHistory +
## IsBorrowerHomeowner, data = loan)
##
## ======================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.642*** 2.645*** 2.672*** 2.574*** 2.593*** 2.624*** 2.637*** 2.647***
## (0.019) (0.019) (0.015) (0.015) (0.015) (0.016) (0.016) (0.016)
## I(ProsperScore) -0.017*** -0.017*** -0.019*** -0.018*** -0.018*** -0.018*** -0.018*** -0.018***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## I(log(CreditScoreMedian)) -0.358*** -0.359*** -0.359*** -0.332*** -0.333*** -0.344*** -0.346*** -0.347***
## (0.003) (0.003) (0.002) (0.002) (0.002) (0.002) (0.002) (0.003)
## I(Term/12) 0.002*** 0.010*** 0.013*** 0.013*** 0.013*** 0.013*** 0.013***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## as.factor(LoanOrigYear): 2010/2009 0.008*** 0.010*** 0.010*** 0.009*** 0.009*** 0.009***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## as.factor(LoanOrigYear): 2011/2009 -0.001 0.007*** 0.006*** 0.004*** 0.004*** 0.004***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## as.factor(LoanOrigYear): 2012/2009 -0.013*** -0.005*** -0.005*** -0.007*** -0.007*** -0.007***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## as.factor(LoanOrigYear): 2013/2009 -0.066*** -0.053*** -0.053*** -0.054*** -0.054*** -0.054***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## as.factor(LoanOrigYear): 2014/2009 -0.087*** -0.073*** -0.073*** -0.072*** -0.072*** -0.072***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## log(LoanOriginalAmount) -0.012*** -0.011*** -0.012*** -0.012*** -0.012***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## log(StatedMonthlyIncome + 1) -0.002*** 0.003*** 0.003*** 0.002***
## (0.000) (0.000) (0.000) (0.000)
## DebtToIncomeRatio 0.013*** 0.012*** 0.012***
## (0.001) (0.001) (0.001)
## CreditHistory 0.000*** 0.000***
## (0.000) (0.000)
## IsBorrowerHomeowner 0.001*
## (0.000)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.5076 0.5086 0.6964 0.7053 0.7061 0.7108 0.7111 0.7112
## adj. R-squared 0.5076 0.5086 0.6964 0.7052 0.7061 0.7108 0.7111 0.7111
## sigma 0.0524 0.0523 0.0411 0.0405 0.0405 0.0397 0.0397 0.0397
## F 43734.2730 29276.8924 24325.9049 22558.1288 20382.1758 17328.0006 15908.9086 14686.2339
## p 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Log-likelihood 129867.0694 129956.3904 150382.5630 151641.7652 151759.6549 140104.2494 140147.8055 140150.2032
## Deviance 232.7137 232.2243 143.4886 139.2925 138.9059 122.4809 122.3434 122.3358
## AIC -259726.1389 -259902.7809 -300745.1260 -303261.5304 -303495.3099 -280182.4989 -280267.6110 -280270.4064
## BIC -259688.7442 -259856.0375 -300651.6393 -303158.6949 -303383.1258 -280062.1349 -280137.9882 -280131.5249
## N 84853 84853 84853 84853 84853 77557 77557 77557
## ======================================================================================================================================================================
Similar to the first iteration,the p value is less than 0.05 for each predictor, indicating that we are confident to reject the null hypothesis that coefficients are zero. This means that all credit-related additions are meaningful to the model. Moreover, there is an overall improvement in R squared from 67.18% to 71.12%. If we look closer, there is actually an improvement on each level of predictor addition, meaning that the variable transformation and addition are worthwhile.
As we can see from the jitter point graph ‘Trend in interest rates for different loan terms’, the same conclusion can be drawn as the one variable qplot in the univariate plot section.
On top of that, time series observation is made when we look at the BorrowerRate over time.
Also, in terms of Loan OriginationQuarter, the set of faceted boxplots ‘Trend in interest rates in top 6 states’ has convinced the economy influence on Prosper’s loan business, as is initially observed in the univariate section as well.
There are two interesting observations.
On the one hand, it’s very helpful to observe the loan count as well as the loan average in the geographical graph of the United States. We know that most loans are from California where Prosper is hubbed, while the highest average loan is DC, an east-coast area. I wonder what reasons are behind that drive people to seek for large amounts of personal loans. From my perspective, California is also a high-salary state with many tech companies. Merely thinking from the economy perspective is not enough to answer this question intuitively.
On the other hand, the fluctuation of BorrowerRate over time follows similar trend as the number of loans over time. This is not surprising but very interesting to think about. More research could be initiated to look more into each State and different types of loans to see what levels of BorrowRate there are.
I created a multivariate model and made a modification on it.More credit-related variables are included to increase the R squared, so that up to 71.12% variation of BorrowerRate is explained. This model is strong enough,having Prosper history, loan features and credit history, although more predictors should be added to strengthen the model even more. Some other variables that could explain the BorrowerRate potentially are TotalInquiries, OpenRevolvingAccounts and PublicRecordsLast12Months.
This is a univariate distribution of BorrowerRate. Refined the theme and axis ranges.The highest frequency of interest rate of a consumer’s loan with Prosper is 32%, followed by range 14%~20%.
This is an upgrade on the negative relationship between interest rate and credit score, also adding a fitted line and layer of loan term. The fitted curve is close to linear, predictable in credit score range 550~850. The darker layer of dots are above the lighter layer, indicating that longer loan term usually means higher interest rate.
This is a geographical heatmap of the average Prosper loan amount in the United States. California has the highest loan count overall, followed by Texas, and Florida etc. There are very few loan listings originated from states in the middlewest.
The Prosper loan dataset is a large and resourceful one that contains 114K observations and more than 80 variables. Some variables have missing values or outliers that complicate the analysis here and there. I made a lot of efforts digging into the distribution and definition of each variable to find meaningful trend, features and relationship among data.
Luckily, I was able to apply all of the techniques I learned from the course to this project. I also researched online for more interesting graphic methods, such as the geographical heatmap. Some plots gave me a surprise, such as the Prosper collected interest and fees, while some others are insightful and meaningful, such as the loan amount and interest rate over time.
Additionally, I was able to build a predictive model on the borrower interest rate by using 9 predictors in three categories - prosper history, loan features and historical credit performance. This model has a high explanation and predictive value for any consumers who are interested in having a personal loan with Prosper.
Nevertheless, more work could be done using this dataset. I could look more into details on the cohorts of a time period, a few states, or one listing category. As for modeling, I could include more variables to increase the explanatory value of the multi-linear model, or I could fit other types of models that may have a higher predictive value.
https://www.nerdwallet.com/blog/loans/prosper-personal-loans-review/
https://stackoverflow.com/questions/7963898/extracting-the-last-n-characters-from-a-string-in-r
https://stackoverflow.com/questions/26665319/removing-na-in-dplyr-pipe
https://cran.r-project.org/web/packages/fiftystater/vignettes/fiftystater.html
https://cran.r-project.org/web/packages/openintro/openintro.pdf
http://sape.inf.usi.ch/quick-reference/ggplot2/colour
https://stackoverflow.com/questions/28427572/manipulating-axis-titles-in-ggpairs-ggally
https://briatte.github.io/ggcorr/
https://stackoverflow.com/questions/40406837/wrap-axis-labels-in-correlation-matrix